Overview Welcome to the 2024 Kaggle Playground Series! We plan to continue in the spirit of previous playgrounds, providing interesting an approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.
Your Goal: The goal of this competition is to predict the probability of a region flooding based on various factors.
@misc{playground-series-s4e5, author = {Walter Reade, Ashley Chow}, title = {Regression with a Flood Prediction Dataset}, publisher = {Kaggle}, year = {2024}, url = {https://kaggle.com/competitions/playground-series-s4e5} }
Step 1: Data Loading We'll load the dataset from the CSV file into a Pandas DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# Display the first few rows of the dataframe
print(df.head())
print(test_df.head())
id MonsoonIntensity TopographyDrainage RiverManagement Deforestation \
0 0 5 8 5 8
1 1 6 7 4 4
2 2 6 5 6 7
3 3 3 4 6 5
4 4 5 3 2 6
Urbanization ClimateChange DamsQuality Siltation AgriculturalPractices \
0 6 4 4 3 3
1 8 8 3 5 4
2 3 7 1 5 4
3 4 8 4 7 6
4 4 4 3 3 3
... DrainageSystems CoastalVulnerability Landslides Watersheds \
0 ... 5 3 3 5
1 ... 7 2 0 3
2 ... 7 3 7 5
3 ... 2 4 7 4
4 ... 2 2 6 6
DeterioratingInfrastructure PopulationScore WetlandLoss \
0 4 7 5
1 5 3 3
2 6 8 2
3 4 6 5
4 4 1 2
InadequatePlanning PoliticalFactors FloodProbability
0 7 3 0.445
1 4 3 0.450
2 3 3 0.530
3 7 5 0.535
4 3 5 0.415
[5 rows x 22 columns]
id MonsoonIntensity TopographyDrainage RiverManagement \
0 1117957 4 6 3
1 1117958 4 4 2
2 1117959 1 3 6
3 1117960 2 4 4
4 1117961 6 3 2
Deforestation Urbanization ClimateChange DamsQuality Siltation \
0 5 6 7 8 7
1 9 5 5 4 7
2 5 7 2 4 6
3 6 4 5 4 3
4 4 6 4 5 5
AgriculturalPractices ... IneffectiveDisasterPreparedness \
0 8 ... 8
1 5 ... 2
2 4 ... 7
3 4 ... 7
4 3 ... 4
DrainageSystems CoastalVulnerability Landslides Watersheds \
0 5 7 5 6
1 4 7 4 5
2 9 2 5 5
3 8 4 6 7
4 3 2 6 4
DeterioratingInfrastructure PopulationScore WetlandLoss \
0 3 6 4
1 1 7 4
2 2 3 6
3 6 4 2
4 6 8 4
InadequatePlanning PoliticalFactors
0 4 5
1 4 3
2 8 3
3 4 4
4 5 5
[5 rows x 21 columns]
Step 2: Data Exploration We'll check the basic information about the dataset to understand its structure and features.
# Check the shape of the dataframe
print("Shape of the dataframe:", df.shape)
# Check the column names and data types
print("Columns and data types:")
print(df.dtypes)
# Check for any missing values
print("Missing values:")
print(df.isnull().sum())
# Summary statistics
print("Summary statistics:")
print(df.describe())
Shape of the dataframe: (1117957, 22)
Columns and data types:
id int64
MonsoonIntensity int64
TopographyDrainage int64
RiverManagement int64
Deforestation int64
Urbanization int64
ClimateChange int64
DamsQuality int64
Siltation int64
AgriculturalPractices int64
Encroachments int64
IneffectiveDisasterPreparedness int64
DrainageSystems int64
CoastalVulnerability int64
Landslides int64
Watersheds int64
DeterioratingInfrastructure int64
PopulationScore int64
WetlandLoss int64
InadequatePlanning int64
PoliticalFactors int64
FloodProbability float64
dtype: object
Missing values:
id 0
MonsoonIntensity 0
TopographyDrainage 0
RiverManagement 0
Deforestation 0
Urbanization 0
ClimateChange 0
DamsQuality 0
Siltation 0
AgriculturalPractices 0
Encroachments 0
IneffectiveDisasterPreparedness 0
DrainageSystems 0
CoastalVulnerability 0
Landslides 0
Watersheds 0
DeterioratingInfrastructure 0
PopulationScore 0
WetlandLoss 0
InadequatePlanning 0
PoliticalFactors 0
FloodProbability 0
dtype: int64
Summary statistics:
id MonsoonIntensity TopographyDrainage RiverManagement \
count 1.117957e+06 1.117957e+06 1.117957e+06 1.117957e+06
mean 5.589780e+05 4.921450e+00 4.926671e+00 4.955322e+00
std 3.227265e+05 2.056387e+00 2.093879e+00 2.072186e+00
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 2.794890e+05 3.000000e+00 3.000000e+00 4.000000e+00
50% 5.589780e+05 5.000000e+00 5.000000e+00 5.000000e+00
75% 8.384670e+05 6.000000e+00 6.000000e+00 6.000000e+00
max 1.117956e+06 1.600000e+01 1.800000e+01 1.600000e+01
Deforestation Urbanization ClimateChange DamsQuality Siltation \
count 1.117957e+06 1.117957e+06 1.117957e+06 1.117957e+06 1.117957e+06
mean 4.942240e+00 4.942517e+00 4.934093e+00 4.955878e+00 4.927791e+00
std 2.051689e+00 2.083391e+00 2.057742e+00 2.083063e+00 2.065992e+00
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 4.000000e+00 3.000000e+00 3.000000e+00 4.000000e+00 3.000000e+00
50% 5.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00
75% 6.000000e+00 6.000000e+00 6.000000e+00 6.000000e+00 6.000000e+00
max 1.700000e+01 1.700000e+01 1.700000e+01 1.600000e+01 1.600000e+01
AgriculturalPractices ... DrainageSystems CoastalVulnerability \
count 1.117957e+06 ... 1.117957e+06 1.117957e+06
mean 4.942619e+00 ... 4.946893e+00 4.953999e+00
std 2.068545e+00 ... 2.072333e+00 2.088899e+00
min 0.000000e+00 ... 0.000000e+00 0.000000e+00
25% 3.000000e+00 ... 4.000000e+00 3.000000e+00
50% 5.000000e+00 ... 5.000000e+00 5.000000e+00
75% 6.000000e+00 ... 6.000000e+00 6.000000e+00
max 1.600000e+01 ... 1.700000e+01 1.700000e+01
Landslides Watersheds DeterioratingInfrastructure \
count 1.117957e+06 1.117957e+06 1.117957e+06
mean 4.931376e+00 4.929032e+00 4.925907e+00
std 2.078287e+00 2.082395e+00 2.064813e+00
min 0.000000e+00 0.000000e+00 0.000000e+00
25% 3.000000e+00 3.000000e+00 3.000000e+00
50% 5.000000e+00 5.000000e+00 5.000000e+00
75% 6.000000e+00 6.000000e+00 6.000000e+00
max 1.600000e+01 1.600000e+01 1.700000e+01
PopulationScore WetlandLoss InadequatePlanning PoliticalFactors \
count 1.117957e+06 1.117957e+06 1.117957e+06 1.117957e+06
mean 4.927520e+00 4.950859e+00 4.940587e+00 4.939004e+00
std 2.074176e+00 2.068696e+00 2.081123e+00 2.090350e+00
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 3.000000e+00 4.000000e+00 3.000000e+00 3.000000e+00
50% 5.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00
75% 6.000000e+00 6.000000e+00 6.000000e+00 6.000000e+00
max 1.800000e+01 1.900000e+01 1.600000e+01 1.600000e+01
FloodProbability
count 1.117957e+06
mean 5.044803e-01
std 5.102610e-02
min 2.850000e-01
25% 4.700000e-01
50% 5.050000e-01
75% 5.400000e-01
max 7.250000e-01
[8 rows x 22 columns]
Step 3: Exploratory Data Analysis (EDA) We'll perform some visualizations to gain insights into the data and understand the relationships between different features.
# Visualize the distribution of the target variable
plt.figure(figsize=(8, 6))
sns.histplot(df['FloodProbability'], kde=True)
plt.title('Distribution of Flood Probability')
plt.xlabel('Flood Probability')
plt.ylabel('Frequency')
plt.show()
# Visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
# Visualize the distribution of each feature
plt.figure(figsize=(15, 20))
for i, column in enumerate(df.columns[:-1]):
plt.subplot(7, 3, i+1)
sns.histplot(df[column], kde=True)
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Pairplot to visualize relationships between features
#sns.pairplot(df, vars=df.columns[:-1], diag_kind='kde')
#plt.suptitle('Pairplot of Features', y=1.02)
#plt.show()
Scatter Plot for Specific Features and Related Visualizations
# Scatter plot to visualize the relationship between MonsoonIntensity and FloodProbability
plt.figure(figsize=(8, 6))
sns.scatterplot(x='MonsoonIntensity', y='FloodProbability', data=df)
plt.title('Scatter plot of MonsoonIntensity vs. Flood Probability')
plt.xlabel('Monsoon Intensity')
plt.ylabel('Flood Probability')
plt.show()
# Scatter plot to visualize the relationship between TopographyDrainage and FloodProbability
plt.figure(figsize=(8, 6))
sns.scatterplot(x='TopographyDrainage', y='FloodProbability', data=df)
plt.title('Scatter plot of TopographyDrainage vs. Flood Probability')
plt.xlabel('Topography Drainage')
plt.ylabel('Flood Probability')
plt.show()
# Scatter plot to visualize the relationship between RiverManagement and FloodProbability
plt.figure(figsize=(8, 6))
sns.scatterplot(x='RiverManagement', y='FloodProbability', data=df)
plt.title('Scatter plot of RiverManagement vs. Flood Probability')
plt.xlabel('River Management')
plt.ylabel('Flood Probability')
plt.show()
Step 4 - Appliying Data Engineering Steps if necessary and creating Model.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Separate the data
X = df.drop([ 'FloodProbability'], axis=1)
y = df['FloodProbability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)
# Instantiate the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Calculate R2 score
r2 = r2_score(y_test, y_pred)
print('------------ACCURACY---------------')
print("R2 score:", r2)
------------ACCURACY--------------- R2 score: 0.8452629126718558
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
# Instantiate the model
dt_model = DecisionTreeRegressor(random_state=42)
# Perform cross-validation
cv_scores_dt = cross_val_score(dt_model, X, y, cv=5, scoring='r2')
# Print cross-validated R2 scores
print("Cross-validated R2 scores for Decision Tree:")
print(cv_scores_dt)
# Calculate mean R2 score from cross-validation
mean_r2_dt = cv_scores_dt.mean()
print("Mean R2 score from cross-validation for Decision Tree:", mean_r2_dt)
Cross-validated R2 scores for Decision Tree: [-0.01559219 0.04086102 0.05993065 0.05304265 -0.01726839] Mean R2 score from cross-validation for Decision Tree: 0.024194746856915805
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
# Instantiate the model
rf_model = RandomForestRegressor(random_state=42)
# Perform cross-validation
cv_scores_rf = cross_val_score(rf_model, X, y, cv=5, scoring='r2')
# Print cross-validated R2 scores
print("Cross-validated R2 scores for Random Forest:")
print(cv_scores_rf)
# Calculate mean R2 score from cross-validation
mean_r2_rf = cv_scores_rf.mean()
print("Mean R2 score from cross-validation for Random Forest:", mean_r2_rf)
Cross-validated R2 scores for Random Forest: [0.63997578 0.64729681 0.65098209 0.64962777 0.64144466] Mean R2 score from cross-validation for Random Forest: 0.645865421956049
# Assuming you already have test_df DataFrame loaded into your environment
# Predict 'FloodProbability' for the testing data
y_pred_test = model.predict(test_df)
# Create a DataFrame with 'id' and predicted 'FloodProbability'
submission_df = pd.DataFrame()
submission_df['id'] = test_df['id'] # Assuming 'id' column exists in test_df
submission_df['FloodProbability'] = y_pred_test
# Save the DataFrame to a CSV file
submission_df.to_csv('submission.csv', index=False)